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Abstract. This paper investigates the stochastic fluctuations of the number 
of copies of a given protein in a cell. This problem has already been addressed 
in the past and closed-form expressions of the mean and variance have been 
obtained for a simplified stochastic model of the gene expression. These results 
have been obtained under the assumption that the duration of all the protein 
production steps are exponentially distributed. In such Markovian 
approach (via Fokker-Planck equations) is used to derive analytic formulas of 
the mean and the variance of the number of proteins at equilibrium. This 
assumption is however not totally satisfactory from a modeling point of view 
since the distribution of the duration of some steps is more likely to be Gauss- 
ian, if not almost deterministic. In such a setting, Markovian methods can no 
longer be used. A finer characterization of the fluctuations of the number of 
proteins is therefore of primary interest to understand the general economy of 
the cell. In this paper, we propose a new approach, based on marked Poisson 
point processes, which allows to remove the exponential assumption. This is 
applied in the framework of the classical three stages models of the litera- 
ture: transcription, translation and degradation. The interest of the method 
is shown by recovering the classical results under the assumptions that all 
the durations are exponentially distributed but also by deriving new analytic 
formulas when some of the distributions are not anymore exponential. Our 
results show in particular that the exponential assumption may, surprisingly, 
underestimate significantly the variance of the number of proteins when some 
steps are in fact not exponentially distributed. This counter-intuitive result 
stresses the importance of the statistical assumptions in the protein produc- 
tion process. Finally, our approach can also be used to consider more detailed 
models of the gene expression. 
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1. Introduction 

The aim of the present work is to revisit and generalize the current mathematical 
results concerning the properties of intrinsic noise in gene expression. The stochastic 
characterisation of the gene expression in the protein production process has been 
theoretically studied by means of stochastic models in the late 70s by Berg [T] and 
Rigney |8] and reviewed recently by Paulsson [5]. For a long period of time, it 
has not been possible to compare the theoretical results to real data, because of the 
lack of appropriate laboratory techniques. In the last two decades, the introduction 
of reliable expression reporter techniques and the use of fluorescent reporters, as 
the GFP (Green Fluorescent Protein), has allowed observations in live cells and 
the experimental quantification of the protein production at single cell level. See 
Taniguchi et al. [T2] for the experimental characterization of a large number of 
messengers and proteins of E. coli. 

The good qualitative agreement observed between experiments and the predic- 
tions of these earlier stochastic models have stimulated further investigations to 
take into account the statistical characteristics of the phenomena involved in the 
protein production process. In this domain, the variance of the number of the 
cellular components in the cell is a key indicator of the efficiency of a production 
strategy, since it gives a measure of the fluctuation of resources of the cell consumed 
by the production process. Clearly enough, this characteristic is directly affected 
by model design and statistical assumptions. 

As in the previous works, see Paulsson j5 a and Swain it is especially im- 
portant to derive explicit analytical formulas for the variance to assess the impact 
of the key parameters and, consequently, to get a biological interpretation of the 
obtained results. 

As it will be seen, the introduction of more realistic statistical assumptions leads 
to several technical difficulties, the main one being that the classic PDE approach 
(Fokker-Planck equations) used in the literature can no longer be used. In the 
present paper, using an approach based on marked Poisson point processes, we relax 
some statistical assumptions of the earlier stochastic models and obtain general 
results for a large class of stochastic models. In the biological context, we are 
then able to derive closed form expressions of the mean and variance of the main 
characteristics of the production process. 

1.1. Biological Context and Model Motivations. We first provide few biolog- 
ical insights about the protein production in living organisms. The gene expression 
is the process by which the genetic information is synthesised into a functional 
product, the proteins. The production of proteins is the most important cellular 
activity, both for the functional role and the high associated cost in terms of re- 
sources (in prokaryotic cells it can reach up to 85% of the cellular resources). In 
particular, in a E. Coli bacterium there are about 3.6 x 10 6 proteins of approxi- 
mately 2000 different types with a large variability in concentration, depending on 
their types: from a few dozen up to 10 5 . 

The information flow from DNA genes to proteins is a fundamental process, 
common to all living organisms and is composed of two main elementary pro- 
cesses: transcription and translation. During the transcription process, the RNA 
polymerase binds to an active gene relative to a specific protein and makes a com- 
plementary copy of a specific DNA sequence, a messenger RNA (mRNA). Each 
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mRNA, which is a long chain of nucleotides, is a chemical "blueprint" for a partic- 
ular protein. The translation of the messenger into a polypeptide chain is achieved 
by a large complex molecule: the ribosome with the help of some accessory factors 
like the elongation factor to cite a few. During translation, the ribosome binds to 
the messenger and builds the polypeptide chain using mRNA as a template. More 
in detail, to each mRNA codon, a triplet of nucleotides, corresponds a specific amino 
acid, which is the fundamental component of proteins. The polypeptide chain of 
amino acids, folds spontaneously or with the help of chaperons, into its functional 
three-dimensional structure. 

The gene expression is a highly stochastic process and results from the realiza- 
tion of a very large number of elementary stochastic processes of different nature. 
The thermal excitation affects many processes, since it implies for example the 
free diffusion in the cytoplasm in which particles behave basically as if they were 
plunged into a viscous fluid. In first approximation, three fundamental mechanisms 
are combined in the protein production. The first is the pairing of two cellular com- 
ponents freely diffusing through the cytoplasm and is a direct consequence of the 
diffusion. The second mechanism is the "spontaneous" rupture of the binding and 
the release of the two components as the result of thermal excitation. The last 
main stochastic process involved is an active one, since it requires/uses energy, and 
corresponds to the processing capability of both polymerase and ribosome. The 
active processes associated to polymerases and ribosomes are highly sophisticated 
steps, including for example dedicated proof reading mechanisms. In order to pro- 
ceed to transcription initiation, gene expression needs a successful binding of the 
polymerase to a specific DNA motif. After the initiation step, the messenger chain 
is built through a series of specific stochastic processes, in which the polymerase 
recruits one of the four nucleotides in accordance to the DNA template. A similar 
description is associated to the translation step. In particular the protein elonga- 
tion results in an iterative energy-consuming procedure in which each codon of the 
messenger chain is coupled with a particular tRNA, which adds a new amino-acid 
to the growing protein chain by means of ribosome. 

In summary, most of the elementary processes can be schematically seen as the 
encounter of two components in a viscous fluid. However, the classic approach to 
gene expression modeling is to group those elementary processes into critic steps 
as initiation, elongation and degradation, which are common to both transcription 
and translation. 

1.2. Mathematical Model. The corresponding mathematical model is now de- 
scribed. For all the reasons given so far, the total number of copies of a given 
protein in the cell is a random variable P. The cell can thus be thought as a sys- 
tem that produces a given protein with average concentration E[P], where E[X] 
denotes the expected value of a random variable X. The protein concentration, 
which can vary of several orders of magnitude depending on the protein type, is in 
direct connection with the various parameters through a quite simple formula, as 
will be shown in the sequel. The main objective of the paper is to derive an explicit 
representation of the variance of the number of proteins in terms of the various 
parameters of the protein production process. 

Gene activation. The gene activation involves complex processes among which 
the main ones are the association/dissociation of a repressor. 
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Usually the whole process is described as a telegraph process for which a tran- 
sition from inactive state to active state 1 occurs at rate Xf and, similarly from 
state 1 to state at rate A^~ . Here the fundamental assumption is that the distribu- 
tion of these steps is exponential. In a prokaryotic cell, there may be several copies 
of a specific gene and this fact has been included in few models in the past years, 
see Paulsson [5] . Nevertheless, since we are interested in the variance of the number 
of proteins, we will assume in the following sections that there is only one copy of 
the gene. The analogous result for the case with multiple copies is straightforward 
to obtain since, by independence, the variance of protein number is proportional to 
the number of copies of the gene. 

Transcription. A RNA polymerase binds on an active gene in an exponential time 
with rate A2. This effective rate measures the frequency of transcription initiation 
and takes into account several physical parameters, including, for example, the 
affinity between the specific gene and the polymerase. The distribution F 2 on K + 
of the lifetime of a mRNA is assumed to be general. 

Translation. Similarly, the binding of a ribosome on an mRNA occurs in an expo- 
nentially distributed time with rate A3, which measures the frequency of translation 
initiation and includes also the affinity between messenger and ribosome. The dis- 
tribution F 3 of the lifetime 03 of the protein is also general. The decay of the 
protein concentration occurs for two main reasons: by proteolysis, i.e. the protein 
degradation into amino acids, or by cellular dilution, due to the cellular volume 
increase of the bacterium during the exponential growth phase. 

This paper is focused in the process of the production of a given protein. For 
this reason, the interaction with the production process of other proteins is not 
considered. 

1.3. Literature: the three-stage model. This is the fundamental model used 
to describe gene expression in the literature. We can already find these key steps in 
the first systematic and accurate studies of stochastic models for gene expression, as 
Rigney [7] and Berg [1] . In recent years the three-stages model has been used as 
the fundamental structure in most well-known works of Shahrezaei and Swain |10j , 
Paulsson [5] and Peccoud and Ycart [5] . 

The promoter of the gene, corresponding to the specific protein of interest, can 
be in one of two possible states: active or inactive. In these studies transcription, 
translation and the degradation of proteins and messengers are modeled as first- 
order chemical reactions, i.e. they are supposed to be exponentially distributed (or 
geometrically distributed in case of a discrete time setting). See Paulsson [5] for 
an extensive survey on the subject. With the above notations, this amounts to say 
that (T2 and 03 are exponentially distributed. 

The assumption of exponentially distributed durations of the various phases of 
the three-stage model leads naturally to a Markovian modeling. The overall dy- 
namic of gene activation can be described, see Paulsson [5], by the random variable 
y{t) € {0, 1}, where Y(t) — 1 indicates that the gene is active at time t, while 
Y(t) — if it is inactive. Recall that we consider, without loss of generality, only 
the one gene case. If we denote by N2(t) the number of mRNAs and by A^i) the 
number of proteins, then it turns out that (X(t)) = (Y(t), N%(t), ^(t)) is a Markov 
process with values in {0, 1} xN 2 . This representation is common to most of the 
models of the literature. Some of them have, in fact, a lower dimensional state 
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space because of assumptions on the number of mRNAs for example. As a conse- 
quence, the general theory of Markov processes gives a system of linear differential 
equations of order 1, the Fokker-Planck equations, for the functions p(t, (y, n 2 , TI3)), 
the probability that X(t) is in state (y, 712,713) at time t. The system of equations 
has the general form 

(1) ^P(*> (y,n2,n 3 )) = Xi(y)p(t, (1 - y,n 2 ,n 3 )) + X 2 p(t, [y,n 2 - l,n 3 ))t {y=1} 

+ a(n 2 )p(t, (y, n 2 ,n 3 - 1)) + f3(n 3 )p(t, (y, n 2 , n 3 )). 

The solution of the system has a unique stable point (jr(y, n 2l n 3 ), (y, n 2 , n 3 ) £ 
{0,1} x N 2 ), the invariant distribution of the Markov process, whose explicit ex- 
pression is not known to the best of our knowledge. Nevertheless, since the coef- 
ficients a(n) and /3(n) are linear with respect to n, the moments of the invariant 
distribution satisfy a recurrence equation. This equation is not completely simple, 
but gives an explicit expression for the first two moments and, in particular, for 
the variance, which is the key quantity to investigate these stochastic models. This 
is the main theoretical result used in many papers in literature, see Rigney [5]. It 
should be kept in mind that this approach is possible only under the assumption 
that all the duration of the main steps (like the production time of an mRNA or of 
a protein) are exponentially distributed. This assumptions is now discussed. 

1.4. Statistical issues: the exponential assumption. We refer to exponen- 
tial assumption when the time to produce a particular cellular component and its 
lifetime, i.e. o~ 2 and 03, are assumed to be exponentially distributed. 

The exponential assumption is natural in the following simple situation: if a 
large number of trials are necessary to achieve some goal (like the binding of some 
elements on a the DNA of an mRNA) and each trial requires some duration D and 
succeeds with probability a. If G a is the total number of attempts to succeed, i.e. 
P(G Q > n) = (1 - a) n , then 

lim ¥(aG a > x) ~ e~ x , 

in other words, if a is small then aG a ~ E\, where E\ is an exponential random 
variable with mean 1. Consequently, the total duration of time necessary to realize 
the objective is, due to the averaging of the law of large numbers (G a is large), 

T]D l ~G a E(D)~ E( ^E 1 
z — ' a 

i=l 

and is therefore exponentially distributed with mean E(D)/a. 

As it is seen, this scheme may describe correctly the duration of time to establish 
a binding of a polymerase or, of a ribosome. This scheme may properly describe 
the time required for a successful binding of RNA polymerase to the gene and of 
ribosome to mRNA. 

It should be noted that this assumption may not be true if one considers the 
elongation time of an mRNA or a protein chain. In particular during the polypep- 
tide elongation, each tRNA, transporting a specific amino acid, should bind to the 
ribosome. If the distribution of the duration of this step is indeed exponential, 
nevertheless the fact that elongation steps requires an average number of 100-300 
steps, one for each amino acid, then the resulting distribution of the duration of 
the whole process is not anymore exponential. In first approximation, because of 
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the large number of elongation steps, a deterministic elongation time with a small 
Gaussian perturbation should be considered. One of the main contributions of this 
paper is to show, via convenient mathematical tools, that the assumption on the 
distributions of 02 and 03 has an important impact on the qualitative properties of 
the protein production process. 

1.5. A Marked Point Process Description of Protein Production. If the 

distributions of 01 and 03 are not exponential, a Markovian description of the 
system is no longer possible, since the residual lifetimes of all the components have 
to be included in the state variable. In this case, to get a possible analogue of the 
PDE 0, an infinite dimensional state space would be required. For this reason, 
there is little hope to use, as it has been done up to now in the literature, the 
equivalent of Fokker-Planck equations to get explicit results like the first moments 
at equilibrium. 

Our approach consists in representing the state of the system as a functional 
of several marked point process. See the appendix for the general definitions and 
results concerning these processes. If it may be difficult to have a PDE formulation 
to the problem, we can have a quite detailed description of the distribution of the 
number of proteins without solving recurrence equations by using an alternative 
method, which use some nice properties of the point processes. See Robert [9]. 
The method is presented in the next section. An extension, see Fromion et al. [2], 
which uses the mathematical approach developped in this paper, considers a finer 
and more complete description of the gene expression. In particular it includes the 
dilution process during the exponential growth phase. 

1.6. Outline of the Paper. Section [2] introduces the marked Poisson point pro- 
cesses used in the mathematical modeling of the production of proteins. In this 
model the lifetime of an mRNA or of a protein has a general distribution instead 
of the exponential assumption used in the models in the literature. Appendix [A 
recalls briefly the main results concerning this class of point processes. Section [3 
gives the main results concerning the equilibrium distribution of the number of 
mRNAs at equilibrium, the main tool in this analysis is the representation in terms 
of marked Poisson point processes and a coupling argument. Section [4] is devoted 
to the derivation of an explicit formula for the variance of the number of proteins. 
Several examples of distributions are discussed. 

2. Stochastic Model 

In this section, the various stochastic processes are introduced. In the appen- 
dix we recall the main results and notations concerning the marked Poisson point 
processes (MPPP) which are used in this paper. 

Gene activation. It is assumed that there is one active gene, which is activated 
at rate \± and inactivated at rate A^~. Recall that the assumption that n max the 
maximum number of active genes is 1 does not restrict the generality of our results 
since the quantities analyzed in this paper (expected values and variances) are pro- 
portional to n max . Let (E n ) and (F n ) be i.i.d. exponential random variables with 
respective rates and Xf . The process of activation of the gene at equilibrium 
can be represented as a stationary process (Y(t),t G K) with values in {0, 1}. Note 
that (Y(t)) is defined on the whole real line, i.e. that the activation/deactivation 



STOCHASTIC GENE EXPRESSION IN CELLS 



7 



process has started at t = — oo. As it will be seen, this is a convenient represen- 
tation to describe properly the equilibrium of the protein production process. The 
increasing sequence of the instants of activation of the gene is denoted by (t n ) with 
the convention that to < < tx. In particular 

{t n , n e 1} = {s e K : F(s-) = and Y(s) = 1} 

and t n+ i — t n = E„ + F n . Because of our assumption (t n ) is a stationary renewal 
point process. 




Figure 1. Three stage model. The gene activation/deactivation 
occur at rate Ai and jii respectively. Transcription and transla- 
tion occur at rates A2 and A3 respectively. The degradation times 
of mRNAs and proteins have probability distributions F 2 (dt) and 
F 3 (dt) respectively. 



Production of mRNAs. When the gene is active, it produces mRNAs at rate 
A2 and F 2 {dy) is the distribution of the lifetime of a mRNA. Let N\. 2 — (sn,02.„) 
be a MPPP on B? + with intensity measure A2 da: ® F 2 (dy) . 
If the gene is, for s < t, then the formula 

Afx 2 ([s,t] x E) = l{ s < s „<t} = / t {s <u<t}N\ 2 (du,dv) 

nGZ J 

represents the total number of mRNAs created between time s and time t and 

X] 1 {s<s n <t< Sn +<r 2:n } = / ^{ s <u<t<u+v}N\ 2 (du,dv) 

is the number of mRNAs still alive at time t. More in general, if we include the 
gene dynamics into the formula, we find that the number of messengers created in 
the time interval [s, t] and still alive at time t is 

1 {s< Sn <t<s n + l r 2irl ,Y( Sn )=i} = / 1 { s <«<t<«+«.y(n)=i}- / ^A 2 (du,dw). 
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Production of Proteins. A given mRNA produces proteins at rate A3 and F 3 (dy) 
is the distribution of the duration of the lifetime of a protein. 

For denote by A/"" a MPPP with intensity A3 dx<E>F 3 (dy). In the following 

it is the process of creation of proteins associated to an mRNAs created at time u. 
In particular, if mRNA lifetime is v then 

Afl([u, u + v] x R+) = f A^ 3 (dx, dy) 

J [u,U-\-v] XI4. 

is the total number of proteins created by such an mRNA during its lifetime. 
Remarks. 

Here the mRNA is available for translation once a small portion of the growing 
mRNA chain has been assembled. This assumption is coherent with the prokaryotic 
dynamics, but should be adapted for the eukaryotic case. In fact in this case we 
have to wait the completed messenger to be exported to the cytoplasm. If we 
assume this time to be deterministic, then the previously defined integral should 
be shifted of a constant value and we should easily get the corresponding analytic 
results. 

The whole process of production of mRNAs and proteins can thus be described 
by the sequence 

A 1 tn 1 -^A 3 ) " 

Recall that A^ 3 : (fi, J 7 , V) -)• M P {R x K+), where M P (R x K+) is the set of point 
processes on K x R + . If we denote with Q the distribution of Af° on A4 P (R x R+), 
the process A can be seen as a marked Poisson point process on K + x Ai p (R x R + ) 
with intensity measure F 3 (dx) x Q. This observation will not be used in the following 
to keep the setting as simple as possible but the proof of Proposition [2] below could 
be shortened by using it together with Proposition |4j 

The notations with some definitions for the stochastic models used in this paper 
are now summarized. 

Notations. - Gene activation. 

The activation rate is [resp. inactivation rate] is A^ [resp. A^~] and 

x+ 

5 + = -i- and A = \i + A7 . 
A 

— mRNA production. 

The rate of production of mRNAs by an active gene is A2, F 2 {dx) is the 
distribution of an mRNA lifetime, <ii denotes a random variable with dis- 
tribution F2 and 

p 2 d = A 2 E(a 2 ) = A 2 [ xF 2 (dx). 

— Protein production. 

The rate of production of proteins by an mRNA is A3, the lifetime distribu- 
tion of a protein is ^(dir), 03 denotes a random variable with distribution 
F 3 and 

p 3 d = A3E(a3) = A 3 / xF 3 {dx). 



STOCHASTIC GENE EXPRESSION IN CELLS 



9 



3. Equilibrium Distribution of the Number of mRNAs 

This section investigates the first part of the protein production process: activa- 
tion of the convenient gene and production of mRNAs. 

3.1. State of the Gene. The behavior of the process (Y(t)) is well known. Once 
equilibrium has been reached, it results 

P(F(0) = 1) = 6 + = — ^— = 1 - P(T(0) = 0). 

To express the variance of the number of proteins, the following quantity is required, 
for t > 0, 

(2) F(Y(t) = 1\Y(0) = 1) = 6+ + (1 - S + )e~ At , 

with A = A^ + A7. See Norris [1] and Peccoud and Ycart [B] for detailed com- 
putations. From now on, it will be assumed that (Y(t)) is defined on K and is at 
equilibrium. 

3.2. Number of mRNAs. A result on the number of mRNAs at equilibrium and 
its distribution is derived in this section. The techniques used to prove it will also 
be used to investigate the distribution of the number of proteins in the next section. 
In order to present the MPPP approach, we will develop computations for mRNAs, 
since they are simpler from the point of view of notations, but include the main 
ideas. 

Proposition 1. The number M of mRNA's at equilibrium can be represented as 

(3) M= i{u<Q<u+v,Y(u)=i} N\ 2 (du,dv), 

JRxR + 

where M\ 2 is a Poisson marked point process with intensity X 2 dx (g> F 2 (dy). 

Proof. Suppose there are no mRNAs at starting time 0, then the number M t of 
mRNAs at time t is given by 

A/ t = ^ 1 r Q< <t< y( )=1 | = / / l {u < t < u+ ^ y(tl)=1} A/A 2 (du,dw), 

n 1 " " " ' ; JR+ Jo 

if Af\ 2 — (s n ,o~2,n) as defined in Section [2j Recall that s n is the (potential) nth 
binding time of a polymerase on the gene: an mRNA is created only if the gene is 
active, i.e. Y(s n ) = 1. The term a 2 „ represents the lifetime of the newly produced 
mRNA. The right-hand-side of the previous equation accounts for the number of 
mRNAs produced in the interval [0, t] and still alive at time t (u + v > t). 

Since the process (Y(t)) is stationary as well as the Poisson marked point process, 
they are both invariant by translation. By translating by — t, one gets that M t has 
the same distribution as 

M t d = ' / / t{o<u+v.Y(u)=i}N\ 2 (du,dv), 
Jr + J -t 

by letting t go to infinity, one obtains the desired result. □ 
Remark 

It is crucial that the distribution of M t can be explicitly expressed as a functional 
of the marked Poisson process Af\ 2 . The same property is true for its limit. In this 
context, with the help of the coupling argument, there is no need of a Markovian 
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setting to prove that M t converges in distribution as t goes to infinity. As will 
be seen, the distribution of the limit M can be obtained by using some properties 
of Poisson point processes. For all these reasons, there is no need to impose the 
random variables o~ 2 and 03 to be exponentially distributed. 

In the proof of the above result, we have in fact proved a more general result. 

Theorem 1. The point process Ai representing the instants of creation of mRNAs 
and the associated lifetime at equilibrium can be represented as 

(4) M= l {y(M)=1} <5 ( „.„ ) A/' A2 (du,dv), 

JRxR + 

where 5 Z is the Dirac mass at z. 

The number of mRNAs alive at equilibrium can thus be represented as 



M 



J ~&{u<t<u+v}M(du, dv) = J t{ u <t<u+v,Y(u)=l}N\ 2 ( du j du) 



which is precisely the expression of Proposition [T] When the activation rate of the 
gene goes to infinity, the point process M. is simply a marked Poisson point process 
and M has a Poisson distribution with parameter P2 = A 2 E(ct 2 ). 

We now use this representation to get an explicit expression of the variance of 
the number of mRNAs at equilibrium. 

Proposition 2. If the distribution of the lifetime of a mRNA is F2(dx), the average 
of the number M of mRNAs at equilibrium is given by 



E(M) = S +P2 = 1 A 2 / xF 2 (dx) 
Aj + A 1 J 



The variance of M is 

r+00 

(5) var(Af) = E(M) + 2p§5+(l - 5+) / e- Av F 2 (u)F 2 (u + v) du dv 

Jo 

where F 2 (x) = F 2 ([0,x}) and F 2 (x) = (1 - F 2 (x)) /E(a 2 ) . 

Proof. Conditionally on the process (Y(t)), M follows a Poisson distribution, hence 
for z € [0,1], 



E(z M I (Y(t))) =exp -A 2 (l-z) 



/ / l{y( u )=i,u+«>0} dua 2 (dv) 
Jr + J -00 ) 

(6) = exp (-X 2 (l-z) J l { y(-„)=i}P(<7 2 > u) 

by taking f(u, v) = l{y( u ) = i. u <o.«+u>o} m Relation ( |12[ ). If we differentiate formula 
Q with respect to z and take z = 1, we obtain 

E (M I (Y(t))) = A 2 f 1 {y(u)=1} P( ( t 2 > -u) du, 

since (Y(t)) is at equilibrium, P(Y(u) = 1) = S + , hence integrating the last relation 
we get 

E(M) = <5+A 2 / P(ct 2 > -u) du = <5+A 2 E(cr 2 ). 

J —OO 
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If we differentiate twice Formula ^ and substitute z = 1, we obtain 

E(M(M - 1) | (T(t))) = A 2 Qf + °° l {y( _ u)=1} P(a 2 > «) dv) 

= A 2 / l{y(-u)=i,y(-K)=i} p (o"2 > u)P(cr 2 > u) du du, 
which, integrated with respect to (Y(t)), gives 

E (M 2 ) - E(M) = Aa / P(Y(-u) = 1, F(-u) = l)P(cr 2 > u, > v) du dv, 



where the random variable a 2 is independent of <j 2 and has the same distribution. 
Using relation ^ , for u < v and A = A f + A^ , we get 

F(Y(-u) = l,Y(-v) = 1) = P(Y(-v) = l)P(Y(-u) = 1 I Y(-v) = 1) 

= 5 + (<5+ + (1 - 5 + )e~ A ( v - u A . 

Therefore E(Af 2 ) - E(M) is the sum of 

\ 2 2 5% I P(cr 2 > u,o^ > v) du dv = (A 2 <S + E(er 2 )) 2 = (E(M)) 2 



and, up to the multiplicative factor 2X 2 S+(1 — S + ), of 

/ P((T 2 > u,W2- > v)e- A( > v - u h {u < v} du dv. 

The proposition is proved. □ 

Normalized variance. By Relation ([5]), the normalized variance of M is defined 

as 

var(M) 1 „l-&4 r+ °° 



E(M) 2 -E(M) S + J ^^(u)F 2 (u + v)dudv. 

When the mean E(M) is fixed, the only quantity which depends on the distribution 
of the lifetime of an mRNA is the integral 

/•+00 

If 3 = / e~ Av F 2 (u)F 2 (u + v) du dv. 
Jo 

To conclude this section, we now apply the previous general formulas to specific 
choices of the probability distribution. In particular we will get analytical formula 
of the previous integral for exponential and deterministic distributions. These as- 
sumptions are not completely realistic from a biologic point of view, nevertheless 
they are used to stress the impact of probability distribution on the messenger vari- 
ance. If the distribution of the lifetime of an mRNA is the exponential distribution 
E^ 2 with parameter /i 2 , one gets 

1 

Ie ^ ~ 2 M2 (A + M2 )' 
deb 

at 1//J.2, the above formula yields 



If the lifetime of an mRNA is the deterministic distribution D^ 2 with a unit mass 



A 
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Straightforward calculations with these formulas show that Ie <In . The ratio 

The variance for the 



3.2 



Id^./Ie^ varies in fact between 1 and 2, see Figure 
exponential distribution is smaller than the one for the deterministic distribution 
with the same mean. This result is not quite intuitive if one takes into account that 
the variance of the exponential distribution is quite large. 



■c 

a 

> 

o 
o 

<S 
P6 




Figure 2. Ratio of Variances of nb of mRNAs: Deterministic/Exponential 



4. Variance of the Number of Proteins at equilibrium 

Recall that if an mRNA is created at time u and has a lifetime v, then on the 
time interval [u, u + v] proteins are created according to the marked Poisson point 
process 7V" 3 with intensity A3 da; <g> F^(dy). The instants of creation of proteins 
together with their lifetimes can thus be represented by the following point process 



(7) V = M(du,dv) 5 (x , y) Afl(dx,dy), 

ilxl + J[u,u+v}xR + 

where Ai is the point process defined by formula Q. 

Proposition 3. The number P of proteins at equilibrium can be represented by the 
random variable 

(8) P=l l{y(«)=i}A/A 2 (du, dv) \ l{ x <o<x+y,u<x<u+v}N\ 3 (dx,dy). 

ilxR + ilxl + 

Proof. The derivation is quite straightforward. If an mRNA alive between time u 
and u + v generates a protein at time x with lifetime y, this protein will be present 
at time Oifx<0<x + y. The argument that this is indeed the representation 
of the number of proteins at equilibrium follows the same lines of the proof of 
Proposition [T] □ 



STOCHASTIC GENE EXPRESSION IN CELLS 



13 



Before the main technical result of the paper, we can get information on the 
distribution of P using formula (l8|). We start with the simple case of the mean. 



For fixed R + , formula (13) gives 



E 



L J x<Q<x-\-y, f x<Q<x-\-y,\ 

\ u<x<u+v J / JlxRx I u^aKu+u J 



dxF 3 (dy) 



Integrating this expression with respect to 1{y(u)=i} N\ 2 (du, du) and taking its 
expectation, we get 



E[P | (F(t))] 



A, E 



4n«)=i} 



l rx<o<K+ a ,i dx<7 3 (dy) 

\ u<x<u+v J 



7Va 2 (du, dv) 



(Y(t)) 



— A 3 / 1{y( u ) = i} 



/ 1 r^o^+^i dxcr 3 (dy) 

J \u<x<u-]-v J 



A2 du (du) 



= A 2 A 3 J 1{y(u+x)=i}P{<72 > -u)P(er 3 > -x) dx du, 



where we used again formula (13). A further integration gives finally the expecta- 
tion 

E(P) = A 2 A 3 / F(Y(u + x) = l)P(cr 2 > -u)P(cr 3 > -x) dx du 

Js. 

= A 2 A 3 £+ J P(cr 2 > -u)¥(a 3 > -x) dx du = 5 + A 2 E(a 2 )A 3 E(c7 3 ), 
with the notation introduced in Proposition [2j 

Theorem 2. If the distribution of the lifetime of a mRNA [resp. protein] is F 2 (dx) 
[resp. F 3 (dy)], then the expected value of the random variable P, which is the 
number of proteins at equilibrium, is given by 

E(P) = 6 +P2P3 = A /_^ A _ A 2 J xF 2 (dx)X 3 J xF 3 (dy) 

and its variance var(P) can be expressed as 



(9) var(P) = E(P) + X 2 p 2 3 5 + 



+OO 



F 3 {u) du 



dsF 2 (dt) 



+ p 2 2 plS+(l - 6+) / e 



-A|(ui-u 2 )+(wi- 



^[F^Faiv^duidVi, 



i=l 



where, for j = 2, 3, Fj(x) = P,([0,x]) and Fj(x) = (1 - Fj (x)) /E(aj) . 
Proof. Recall that N\ 2 can also be represented as M\ 2 — (s n ,t n ) and 



t{Y(s n ) = l}~k f x<0<x+y, 

\s n <x<s n -\-t n J 



x+y, \A/" A s :(da;,dy). 
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Denote by E the conditional expectation E(- | (Y(t)), (s n ,t n )). The conditional 
generating function E (z p ) can be written as 



E ( JJcxp (-log(z) / 
Vnez V Jr> 



l{y( s „)=i}lr x<o<x+ y , iAf^(dx,dy) 

\s n <x<s n +t n S 



= ]JE cxp -log(z) / l {Y(Sn)=1} lr x <o<x+y, i A/j^ (dx, dy) 

ngZ \ \ JBxR + ls n <a<s n +t n / J J 

since the point processes Af£ n , n G Z, are independent. 

The nth term of this product is, applying Proposition [4] to the marked Poisson 
point processes A/jJ , 

exp ^-A 3 (l - z)l{ Y (s„)=i} J 1| x<o<x+y, | dxF 3 (dy)^j 

By integrating E (z p ) with respect to Af\ 2 , the generating function can thus be 
written as 

E(z p \(Y(t))) =E(exp(- J g(u,v)Af X2 (du,dv)y\ , 

where 

g(u,v) = A 3 (l - z)l {Y (u)=i} / ^-(x<a<x+yA dxF 3 (dy). 

J \ u<x<u-\-v J 

Applying again Proposition [4] to the marked Poisson point process A/> 2 , we get 
E(z p \(Y(t))) = 

I I I \ 

= exp -A 2 du F 2 {dv) 1-exp -A 3 (l - z) / 1 r x <o<x+ y ,] dx F 3 (dy) 

\ \ \ I Y(u)=l J / 

In order to obtain an expression for E (P(P — l)\(Y(t))), we have to differentiate 
twice the previous formula with respect to z and evaluate it at z = 1. The resulting 
formula should then be integrated with respect to (Y(t)) and we can get formula 
([9]), by using similar arguments as in the proof of Proposition^ (with more technical 
calculations) . □ 



Applications. 

To show the effectiveness of the analytic formula ^ of the protein variance, one 
considers the cases of exponential and deterministic distributions. More realistic 
cases are considered, see the figure. This specific analysis will give an indication 
of the impact of the distribution on the protein variance. In each case the average 
lifetime of an mRNA [resp. protein] is I/V2 [resp. I///3]. Recall that <5 + = A^/A 
and A = Af + A^. As in the case of mRNAs above, if from a biological point of 
view these assumptions are not completely realistic, this analysis shows the impact 
of the distribution on the variance, and therefore of the necessity of having closed 
form expressions for a large set of distributions. 

Exponential Distribution. 

If the distribution of the lifetime of an mRNA [resp. protein] is exponential with 
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parameter /i 2 [resp. /Z3], then formula ^ gives the classical result on the variance, 
see Paulsson [5], 

(10) var £ (P) = E(P) f 1 + + ^Ml^K^+^ 



Deterministic Case. 

If the lifetime of an mRNA is exponentially distributed with parameter /Z2 and the 
protein lifetime is deterministic, equal to I//13, then formula ^ gives the identity 



(11) var D (P)=E(P) 



1 



V M2 V 



-1*2/1*3 



+ 



M2 

2A 2 A 3 (l-<5 + )/i 2 



A 2 



A*3 

.A 2 



-A//i 3 



1 _ g-/WA*3 



1 


1 






~ A 2 


)■] 



As it can be seen Relation ^ gives an explicit, but intricate expression for 
the variance, we will present some numerical experiments based on this formula. 
The figures |3j [4] and [5] consider the case when the average number of proteins at 
equilibrium is fixed and equal to 300, that A2 = 0.02, A^ = 0.01 and that the 
average of the lifetime of an mRNA [resp. protein] is 172 [resp. 1000]. We have 
considered several possible choices for the distribution Fa, it is assumed that all 
the other distributions are exponential. The parameter S of the Gaussian is its 
variance. 




Figure 3. Square Root of Relative Variance of Nb of Proteins 
with a fixed mean 
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Figure 4. Square Root of Ratio of Variances of Nb of Proteins 
with a fixed mean 
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Figure 5. Square Root of Relative Variance of Nb of Proteins 
with a fixed mean 



Appendix A. A Reminder on Marked Poisson Processes 

The main results concerning Poisson processes seen as marked point processes 
are briefly recalled. See Kingman [3 and Chapter 1 of Robert [5] for a more detailed 
account. Throughout this section H is the space K d for some d > 1. 
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Definition 1. If X > 0, \x is a probability distribution on H , a marked Poisson 
process on R + x H with intensity X dx £g> /u, is a sequence Af\ = (t n , X n ) of elements 
ofR + x H where 

— (t n ) is a (classical) Poisson process on R + with rate X. 
- (X n ) is an i.i.d. sequence with values in H and whose distribution is H. 

The sequence Af\ can also be seen as a marked point process on R + x H , i.e. if 
/ : M. + x H —> M. + is a continuous function then 



A/; 



(/)=/ f(u,x)M x (du,dx) = V/(t n ,X„ 
Jr+xh n>1 



In other words Af\ can also be seen as a sum of Dirac masses at the points (t n , X n ). 
The following important proposition characterizes marked Poisson point processes. 

Proposition 4. The point process Af\ = {t n , X n ) is a marked Poisson point process 
with intensity X dx ® fi if and only if the relation 

(12) E(cxp(-AA A (/)))=cxp^-A^ +0 °(l- e -^) dun(dx)^j 

holds for any non-negative continuous function f on M + x H . 



The left-hand-side of Equation ( 12 ) is usually defined as the Laplace transform 
of TVa at /. This quantity determines completely the distribution of any marked 
point process. 



For £ > 0, by replacing / by £/ in Relation ( 12 ), one gets an expression for 

E[exp(-£A/A (/))], 

if one differentiates it with respect to £ and sets £ = 0, the above identity gives 
(13) E (Afx(f)) = E ( / f(u,x)M x (du,dx))=X [ f(u,x) du/i(dx). 

\JR + xH I JR+xH 
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